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Abstract 

One of the most promising ways of improving the perfor¬ 
mance of deep convolutional neural netw’orks is by increas¬ 
ing the number of convolutional layers. However, adding 
layers makes training more difficult and computationally 
expensive. In order to train deeper networks, we propose to 
add auxiliary supervision branches after certain intermedi¬ 
ate layers during training. We formulate a simple rule of 
thumb to determine where these branches should be added. 
The resulting deeply supervised structure makes the train¬ 
ing much easier and also produces better classification re¬ 
sults on ImageNet and the recently released, larger MIT 
Places dataset. 

1. Introduction 

In the most recent ILSVRC competition [ 7 ], it was 
demonstrated [9, 10] that CNN accuracy can be improved 
even further by increasing the network size: both the depth 
(number of levels) and the width (number of units at each 
level). On the down side, bigger size means more and more 
parameters, which makes back-propagation slower to con¬ 
verge and prone to overfitting [9, 10], To overcome this 
problem, Simonyan and Zisserman [9] propose to initialize 
deeper networks with parameters of pre-trained shallower 
networks. However, this pre-training is costly and the pa¬ 
rameters may be hard to tune. Szegedy et al. [10] propose 
to add auxiliary classifiers connected to intermediate layers. 
The intuitive idea behind these classifiers is to encourage 
the feature maps at lower layers to be directly predictive of 
the final labels, and to help propagate the gradients back 
through the deep network structure. However, Szegedy 
et al. [10] do not systematically address the questions of 
where and how to add these classifiers. Independently, Lee 
et al. [ 6 ] introduce a related idea of deeply supervised net¬ 
works (DSN) where auxiliary classifiers are added at all in¬ 
termediate layers and their “companion losses” are added to 
the loss of the final layer. They show that this deep supervi¬ 
sion yields an improved convergence rate, but their experi¬ 


ments are limited with a not-so-deep network structure. 

In this work, to train deeper networks more efficiently, 
we also adopt the idea of adding auxiliary classifiers after 
some of the intermediate convolutional layers. We give a 
simple rule of thumb motivated by studying vanishing gra¬ 
dients in deep networks. We use our strategy to train mod¬ 
els with 8 and 13 convolutional layers, which is deeper than 
the original AlexNet [ 5 ] with 5 convolutional layers, though 
not as deep as the networks of [9, 10], which feature 16 
and 21 convolutional layers, respectively. Our results on 
ImageNet [7] and the recently released larger MIT Places 
dataset [13] confirm that deeper models are indeed more 
accurate than shallower ones, and convincingly demonstrate 
the promise of deep supervision as a training method. Com¬ 
pared to the very deep GoogleNet model trained on MIT 
Places dataset [1], an eight convolutional layer network 
trained with our method can give similar accuracy but with 
faster feature extraction. Our model on the Places Dataset 
is released in the Caffe Model Zoo [2], 

2. Our Method 

Since very deep models have only made their debut in 
the 2014 ILSVRC contest, the problem of how to train them 
is just beginning to be addressed. Simonyan and Zisser¬ 
man [9] initialize the lower convolutional layers of their 
deeper networks with parameters obtained by training shal¬ 
lower networks, and initialize the rest of the layers ran¬ 
domly. While they achieve very good results, their train¬ 
ing procedure is slow and labor-intensive, since it relies on 
training models of progressively increasing depth, and may 
be very sensitive to implementation choices along the way. 
Szegedy et al. [10] connect auxiliary classifiers (smaller net¬ 
works) to a few intermediate layers to provide additional 
supervision during training. However, their report does not 
give any general principles for deciding where these classi¬ 
fiers should be added and what their structure should be. 

Lee et al. [6] give a more comprehensive treatment of the 
idea of adding supervision to intermediate network layers. 
Their deeply-supervised nets (DSN) put an SVM classifier 
on top of the outputs of each hidden layer; at training time. 
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Figure 1: Illustration of our deep models with 8 and 13 convolutional layers. The additional supervision loss branches 
are indicated by dashed red boxes. Xi denote the intermediate layer outputs and Wj are the weight matrices for each 
computational block. Blocks of the same type are shown in the same color. A legend below the network diagrams shows the 
internal structure of the different block types. 


they optimize a loss function that is a sum of the overall 
(final-layer) loss and companion losses associated with all 
intermediate layers. Our work has a number of differences 
from [6], First, they add supervision after each hidden layer, 
while we decide where to add supervision using a simple 
gradient-based heuristic described below. Second, the clas¬ 


sifiers of [6] are SVMs with a squared hinge loss. By con¬ 
trast, our supervision branch is more similar to that of [10] 
- it is a small neural network composed of a convolutional 
layer, several fully connected layers, and a softmax classi¬ 
fier (see Fig.(l)). Since feature maps at the lower convolu¬ 
tional layers are very noisy, to achive good performance, we 
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Figure 2: (a) Mean gradient magnitude as a function of back-prop iterations for an 8 -layer model trained on ImageNet with 
“vanilla” initialization (see text); (b-d) Comparison of gradient values before and after adding auxiliary supervision after the 
fourth convolutional layer. CNDS stands for our method: Convolutional Networks with Deep Supervision. For CNDS, after 
a few thousand iterations, the gradient magnitudes stop growing and stay steady for a long time. 


have found it important to put them through dimensional¬ 
ity reduction and discriminative non-linear mapping before 
feeding them into a classifier. 

To decide where to add the supervision branches, we fol¬ 
low an intuitive rule of thumb. First, we run a few (10-50) 
iterations of back-propagation for the deep model with su¬ 
pervision only at the final layer and plot the mean gradient 
values of intermediate layers (using the standard initializa¬ 
tion for AlexNet [5]: weights are sampled from a Gaussian 
with zero mean, std=0.01, and bias=0). Then we add super¬ 
vision after the layer where the mean gradient value van¬ 
ishes (in our implementation, becomes less than 10~ 7 ). As 
shown in Fig. (2), in our eight-layer model, the gradients in 


the fourth convolutional layer tend to vanish. Therefore, we 
add auxiliary supervision right after this layer. 

The top of Fig.(l) shows the resulting network struc¬ 
ture, consisting of a main and an auxiliary supervision 
branch. In the main branch, weights Wi, ...,Wu are as¬ 
sociated with the eight convolutional and three fully con¬ 
nected layers. The auxiliary branch comes with weights 
W s 5 , W s 6 , FF S 7 , W s $. Let W and W s denote the concate¬ 
nations of the two respective sets of parameters: 

W = (Wi,...,Wu), 

Ws = (W s5 ,...,w s8 ). 

Given a training example that produces feature map A'n 





















at the softmax layer, for each possible label k = 1 ,..., K 
this layer computes the response 

_ exp(X 11(fc )) 

Pk E fc exp(X 11(fe) ) ’ 

where is the k th element of response l u . The 

associated loss for the entire network is 

K 

£ 0 (w) = In Pk 

k =1 

where yk = 1 if the example has label k and 0 otherwise. 

Analogously, given feature map S$ before the softmax 
layer of the auxiliary supervision branch, we have the output 

_ exp( 6 , s( fc )) 

Psk E k eMS 8{k) ) 

where S'g(fc) is the k th element of response Ss and associ¬ 
ated auxiliary loss 

K 

£ S (W,W S ) = -Y,y k In p sk . 
k =l 

Note that this loss depends on W, not just W s , because the 
computation of the feature map S§ involves the weights of 
the early convolutional layers Wi, . .. IT 4. 

The combined loss function for the whole network is 
given by a weighted sum of the main loss £o(W) and the 
auxiliary supervision loss <C S (W S ): 

C(W,W s ) = C 0 CW) + a t jC s (W,W s ), ( 1 ) 

where a t controls the trade-off between the two terms. In 
the course of training, in order to use the second term 
mainly as regularization, we adopt the same strategy as 
in [ 6 ], where a decays as a function of epoch t (with N 
being the total number of epochs): 

a t <- a t * (1 - t/N ). (2) 

We train our deeply supervised model using stochas¬ 
tic gradient descent. When doing back-propagation, 
W 5 ...., W-\ 1 , are only affected by the main loss Cq. Sim¬ 
ilarly, W s 5 ,..., W s 8 are only affected by the auxiliary loss 
C s . However, starting from IV 4 , where the gradients tend to 
vanish, the C s term successfully magnifies the gradients, 
as can be seen from the before-and-after comparisons in 
Fig.(2)(b-d). 

In addition to our 8 -layer model, we have also experi¬ 
mented with a 13-layer model (Fig.(l), middle). For this 
model, gradients tend to decay every three to four layers, 
and we get good results by putting the supervision branches 
after the 10th, 7th, and 4th layers. All the auxiliary losses 


have the same weights at (starting with 0.3 and then decay¬ 
ing according to eq. (2)). We do not give the resulting loss 
functions here, but their derivation is straightforward. 

In the following, we will refer to our training method as 
CNDS (Convolutional Networks with Deep Supervision). 

3. Experiments 

Sections 3.1 and 3.2 will present an evaluation of our 
models trained on the two largest datasets currently avail¬ 
able: ImageNet (ILSVRC) [7] and MIT Places [13], 

3.1. ImageNet Results 

We report results on the ILSVRC subset of ImageNet [3], 
which includes 1000 categories and is split into 1.2M train¬ 
ing, 50K validation, and 100K testing images (the latter 
have held-out class labels). The classification performance 
is evaluated using top-1 and top-5 classification error. Top-5 
error is the proportion of images such that the ground-truth 
class is not within the top five predicted categories. 

All our ImageNet models are trained with Cuda- 
convnet 2 (https://code.google.eom/p/cuda-convnet 2 /) using 
epoch unit. Because Cuda-convnet2 supports multi-GPU 
training, we can train deeper networks in a reasonable time. 
We use Caffe [4] default setting for training ImageNet, 
that is, we crop one center and four corner patches of size 
227 x 227 pixels (out of 256 x 256) and do horizontal flip¬ 
ping. We do not use model averaging or multi-scale train¬ 
ing/testing. Please see the supplementary material for de¬ 
tails of all the implementation settings for our models. 

First, we survey the top systems in the ILSVRC competi¬ 
tions from 2012 to 2014. For models with five convolutional 
layers, in the 2012 version of the contest, the highest results 
were achieved by Krizhevsky et al. [5], who have reported 
40.7% top-1 and 18.2% top-5 error rate using a single net. 
Subsequently, Zeiler and Fergus [12] have obtained 36.0% 
top-1 and 16.7% top-5 error rates by refining Krizhevsky’s 
filters and combining six nets. In the 2013 competition, Ser- 
manet et al.’s OverFeat [8] obtained 34.5% top-1 and 13.2% 
top-5 error rate by combining seven nets. In the 2014 com¬ 
petition, big progress was made by deeper models: the VGG 
group [9] has trained a series of deep models to get 25.5% 
top-1 and 8.0% top-5 error. And a 22-layer GoogLeNet [10] 
has achieved a top-5 error rate of 6.7%. For a fair compari¬ 
son, Table 1(a) lists single-model results from each of these 
systems. 

In this work, we train networks with 8 and 13 convolu¬ 
tional layers using deep supervision. First, as a baseline, we 
train an ImageNet-CNN -8 model, which contains 8 convo¬ 
lutional layers and 3 fully connected layers, using the strat¬ 
egy in [9]: we first train a network with five convolutional 
layers, and then we initialize the first five convolutional lay¬ 
ers and the last three fully connected layers of the deeper 
network with the layers from the shallower network. The 




other intermediate layers are initialized randomly. Includ¬ 
ing the time for training the shallower network, ImageNet- 
CNN-8 takes around 6 days with 80 epochs on two NVIDIA 
Tesla K40 GPUs with batch size 128. 

Next, we train an ImageNet-CNDS-8 model using our 
deep supervision method. This model is trained with aux¬ 
iliary supervision added after the fourth convolutional layer 
as shown in Fig.(l). This model takes around 5 days to train 
with 65 epochs on two K40 GPUs with batch size 128. The 
weight at starts with 0.3 in all our CNDS training and de¬ 
cays according to eq. (2). Apart from the initialization and 
training procedure, we use the same network and parameter 
settings for both ImageNet-CNDS-8 and ImageNet-CNN- 
8 . It should be noted that in the testing phase, the auxiliary 
supervision branch of the CNDS model is cut off so it has 
the same feedforward path as the CNN model. 

In order to go deeper, we also train an ImageNet-CNDS- 
13 model with structure shown in Fig. (1). ImageNet- 
CNDS-13 takes around 5 days on four GPUs using 67 
epochs with batch size 128. Since initializing weights for 
such a deep structure is in itself an open problem, we only 
train it with our CNDS method. 

Table 1(b) shows the top-1 and top-5 accuracies of our 
models on the validation set of ILSVRC. First, both our 
8 -layer models outperform state-of-the-art 5-layer models 
from the literature, and the 13-layer model outperforms the 
8 -layer models. Therefore, “going deeper” really is an ef¬ 
fective way to improve classification accuracy. Second, 
ImageNet-CNDS-8 is 1% more accurate than ImageNet- 
CNN-8, while taking less time to train. It is important to 
emphasize that both models have the exact same structure 
at test time. Therefore, we can think of deep supervision as 
a form of regularization that gives better local minima for 
the classification task (since a* eventually decays to zero, 
at the end, the loss we are optimizing is the original loss 

C 0 ). 

In absolute terms, our models are still less accurate than 
the VGG models of the same depth. However, we believe 
that this difference mainly comes from the network settings. 
In particular, we use a stride of 2 and filter size of 7 at 
the first layer, while they use a stride of 1 and filter size 
of 3, which gives them a finer-grained representation; we 
use single-scale training, while they use multi-scale. How¬ 
ever, all other factors being equal, deep supervision may be 
a more promising strategy for training very deep networks 
than the iterative deepening scheme of [9], since it is less 
complex and takes less time to train. 

3.2. Places Results 

ImageNet images mainly have center around objects, 
while the recently released MIT Places dataset [ 1 3] is scene¬ 
centric. For training of deep networks, a subset of Places, 
called Places-205, has been created, which contains 2.4M 


training images from 205 categories, with 5000-15000 im¬ 
ages per category. The validation set contains 100 images 
per category and the test set contains 200 images per cate¬ 
gory (with held-out class labels). The training set of Places 
is almost two times larger than the ILSVRC training set and 
60 times larger than the SUN dataset [11], 

As a baseline, we use the five-layer net that was released 
along with Places. It was trained using the Caffe pack¬ 
age on a GPU NVIDIA Tesla K40. As reported in [13], 
the process took 6 days and 300K iterations. We train two 
models for comparison: Places-CNN-8 and Places-CNDS- 
8 , whose structure and parameters are the same as those of 
ImageNet-CNN-8 and ImageNet-CNDS-8. The only dif¬ 
ference is that we train these models using Caffe instead 
of Cuda-convnet2, to stay consistent with the pre-trained 
Places model. Places-CNN-8 is trained the same way as 
ImageNet-CNN-8, using a pre-trained five-layer network as 
initialization. Including the pre-training time. Places-CNN- 
8 takes around 8 days with 240,000 iterations on a single 
K40 GPU with batch size 256 (Caffe only allows single- 
GPU training). Places-CNDS-8 takes around 6 days with 
190,000 iterations. Same as for ImageNet, the weight at 
starts with 0.3 and decays according to eq. (2). 

Following the evaluations in [13], we give the top- 
1 and top-5 accuracy on both validation and test set. 
Note that for the test set, the ground-truth labels are 
not released. Instead, we sent our the prediction of 
both top 1 and top 5 labels to the MIT testing website 
(http://places.csail.mit.edu/submit.html) and got the results 
automatically. 

From Table (2), we can see that Places-CNN-8 and 
Places-CNDS-8 both outperform the original Places-CNN- 
5. Consistent with the results on ImageNet, our CNDS 
model surpasses Places-CNN-8 by about 1%. Overall, with 
a combination of a deeper model and deep supervision, we 
achieve about 5% higher accuracy than the baseline num¬ 
ber reported in [13]. Our network compares favorably with 
the one trained with the GoogleNet structure, and has faster 
feature extraction speed since it is not as deep. 


4. Discussion 

This work has focused on the idea of training very deep 
networks with auxiliary supervision inserted at intermedi¬ 
ate layers. We have attempted to formulate sound design 
principles of where and how deep supervision should be 
added. Our experiments have also shown the advantage 
of this technique over alternative methods that require pre¬ 
training of shallower networks [9]. Along the way, we have 
reported new state-of-the-art results on the recently released 
very large Places dataset [13]. 



Method 

layers 

top -1 

top-5 

(a) Krizhevsky et al. [ 5 ] 

5 

40.7 

18.2 

OverFeat [ 8 ] 

5 

39.0 

16.9 

Zeiler and Fergus [ 2] 

5 

37.5 

16.0 

VGG [ 9 ] 

8 

29.6 

10.4 

VGG [ 9 ] 

13 

27.0 

8.8 

GoogLeNet [10] 

21 

- 

10.07 

(b) ImageNet-CNN -8 

8 

34.7 

14.0 

ImageNet-CNDS -8 

8 

33.8 

13.2 

ImageNet-CNDS-13 

13 

31.8 

11.8 

Table 1: ILSVRC 2012-2014 results: top-1 and 

top-5 error 


on validation set. (a) Top single-model results from the lit¬ 
erature (see text for additional results and discussion), (b) 
Results for our models. 


Methods top-1 val/test 

Places-CNN-5 [! 3 ] 50.4 / 50.0 

Places-GoogleNet [ I] -/56.3 
Places-CNN -8 (ours) 54.0 / 53.8 

Places-CNDS -8 (ours) 54.7 / 55.7 


top-5 val/test 
80.9/81.1 
-/ 86.0 
83.7 /83.6 

84.1 / 85.8 
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